A Simulation Study Comparing Two Methods Of Evaluating Differential Test Functioning (DTF): DFIT and the Mantel-Haenszel/Liu-Agresti Variance
نویسندگان
چکیده
This study uses simulated data to compare two methods of calculating Differential Test Functioning (DTF): Raju’s DFIT, a parametric method that measures the squared difference between two Test Characteristic Curves (Raju, van der Linden & Fleer, 1995), and a variance estimator based on the Mantel-Haenszel/Liu-Agresti method, a non-parametric method enabled in the DIFAS (Penfield, 2005) program. Most research has been done on Differential Item Functioning (DIF; Pae & Park, 2006), and theory and empirical studies indicate that DTF is the summation of DIF in a test (Donovan, Drasgow & Probst; 2000, Ellis & Mead, 2000; Nandakumar, 1993). Perhaps because of this, measurement of DTF is under-investigated. A number of reasons can be given why the study of DTF is important. From a statistical viewpoint, items, when compared to tests, are small and unreliable samples (Gierl, Bisanz, Bisanz, Boughton, & Khaliq, 2001). As an aggregate measure of DIF, DTF can present an overall view of the effect of differential functioning, even when no single item exhibits significant DIF (Shealy & Stout, 1993b). Decisions about examinees are made at the test level, not the item level (Ellis & Raju, 2003; Jones, 2000; Pae & Park, 2006; Roznowski & Reith, 1999; Zumbo, 2003). Overall both methods performed as expected with some exceptions. DTF tended to increase with DIF magnitude and with sample size. The MH/LA method generally showed greater rates of DTF than DFIT. It was also especially sensitive to group distribution differences (impact) identifying it as DTF where DFIT did not. An empirical cutoff value seemed to work as a method of determining statistical significance for the MH/LA method. Plots of the MH/LA DTF indicator showed a tendency towards and F-distribution for equal Reference and focal group sizes, and a normal distribution for unequal sample sizes. Areas for future research are identified. INDEX WORDS: DTF, Differential Test Functioning, DFIT, Mantel-Haenszel A SIMULATION STUDY COMPARING TWO METHODS OF EVALUATING DIFFERENTIAL TEST FUNCTIONING (DTF): DFIT AND THE MANTELHAENSZEL/LIU-AGRESTI VARIANCE by Charles Vincent Hunter, Jr.
منابع مشابه
Differential item functioning procedures for polytomous items when examinee sample sizes are small
As part of test score validity, differential item functioning (DIF) is a quantitative characteristic used to evaluate potential item bias. In applications where a small number of examinees take a test, statistical power of DIF detection methods may be affected. Researchers have proposed modifications to DIF detection methods to account for small focal group examinee sizes for the case when item...
متن کاملA new approach for differential item functioning detection using Mantel-Haenszel methods. The GMHDIF program.
To date, the statistical software designed for assessing differential item functioning (DIF) with Mantel-Haenszel procedures has employed the following statistics: the Mantel-Haenszel chi-square statistic, the generalized Mantel-Haenszel test and the Mantel test. These statistics permit detecting DIF in dichotomous and polytomous items, although they limit the analysis to two groups. On the con...
متن کاملAlternate Cutoff Values and DFIT Tests of Measurement Invariance
Likert scales are routinely used in educational and psychological research as measures of constructs of interest. If sound scale development procedures are followed, the resulting scale can reliably and validly measure a construct. However, if a given scale is used to make comparisons among different populations of respondents (e.g., cultures; Riordan & Vandenberg, 1994), over time in longitudi...
متن کاملSensitivity of DFIT Tests of Measurement Invariance for Likert Data
Likert scales are routinely used in educational and psychological research as measures of constructs of interest. If sound scale development procedures are followed, the resulting scale can reliably and validly measure a construct. However, if a given scale is used to make comparisons among different populations of respondents (e.g., cultures; Riordan & Vandenberg, 1994), over time in longitudi...
متن کاملAcademic Discipline DIF in an English Language Proficiency Test
The purpose of this study was to detect differentially functioning items in the University of Tehran English Proficiency Test (UTEPT) which is a high stake test of English developed and administered by the Language Testing Centre of the University of Tehran. This paper is based on the answers of 400 test takers to the test. All participants earned a master degree either in humanities or science...
متن کامل